[DeepLearning_Tech]Word2vec

참고:

  1. https://programmers.co.kr/learn/courses/21/lessons/1697 (프로그래머스 페이지 박조은 강사님 예제 실행)

  2. https://www.kaggle.com/c/word2vec-nlp-tutorial/overview/part-2-word-vectors (Bag of Words Meets Bags of Popcorn 데이터 사용)

데이터 파일 불러오기

In [1]:
# 외부 파일을 불러오기 위해 pandas import
import pandas as pd
In [2]:
# train 파일 불러오기
train = pd.read_csv('Data/labeledTrainData.tsv', header=0, delimiter='\t', quoting=3)
In [3]:
# test 파일 불러오기
test = pd.read_csv('Data/testData.tsv', header=0, delimiter='\t', quoting=3)
In [4]:
# 라벨링 되지 않은 데이터 불러오기
unlabeled_train = pd.read_csv('Data/unlabeledTrainData.tsv', header=0, delimiter='\t', quoting=3)

데이터 구조 확인

In [5]:
# train 데이터 구조 보기
print(train.shape)
(25000, 3)
In [6]:
# test 데이터 구조 보기
print(test.shape)
(25000, 2)
In [7]:
# 라벨링 되지 않은 데이터 구조 보기
print(unlabeled_train.shape)
(50000, 2)
In [8]:
print(train['review'].size)
25000
In [9]:
print(test['review'].size)
25000
In [10]:
print(unlabeled_train['review'].size)
50000
In [11]:
# 데이터의 형태 확인
train.head()
Out[11]:
id sentiment review
0 "5814_8" 1 "With all this stuff going down at the moment ...
1 "2381_9" 1 "\"The Classic War of the Worlds\" by Timothy ...
2 "7759_3" 0 "The film starts with a manager (Nicholas Bell...
3 "3630_4" 0 "It must be assumed that those who praised thi...
4 "9495_8" 1 "Superbly trashy and wondrously unpretentious ...
In [12]:
test.head()
Out[12]:
id review
0 "12311_10" "Naturally in a film who's main themes are of ...
1 "8348_2" "This movie is a disaster within a disaster fi...
2 "5828_4" "All in all, this is a movie for kids. We saw ...
3 "7186_2" "Afraid of the Dark left me with the impressio...
4 "12128_7" "A very accurate depiction of small time mob l...
In [13]:
unlabeled_train.head()
Out[13]:
id review
0 "9999_0" "Watching Time Chasers, it obvious that it was...
1 "45057_0" "I saw this film about 20 years ago and rememb...
2 "15561_0" "Minor Spoilers<br /><br />In New York, Joan B...
3 "7161_0" "I went to see this film with a great deal of ...
4 "43971_0" "Yes, I agree with everyone on this site this ...

Kaggle word2vec으로 모델링

In [14]:
# Kaggle word2vec
from KaggleWord2VecUtility import KaggleWord2VecUtility
In [15]:
# train 데이터 word_list
KaggleWord2VecUtility.review_to_wordlist(train['review'][0])[:10]
Out[15]:
['with', 'all', 'this', 'stuff', 'go', 'down', 'at', 'the', 'moment', 'with']
In [16]:
sentences = []
for review in train["review"]:
    sentences += KaggleWord2VecUtility.review_to_sentences(review, remove_stopwords=False)
C:\Users\BowlMin\Anaconda3\envs\py35\lib\site-packages\bs4\__init__.py:294: UserWarning: "b'.'" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
  ' Beautiful Soup.' % markup)
C:\Users\BowlMin\Anaconda3\envs\py35\lib\site-packages\bs4\__init__.py:294: UserWarning: "b'...'" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
  ' Beautiful Soup.' % markup)
C:\Users\BowlMin\Anaconda3\envs\py35\lib\site-packages\bs4\__init__.py:357: UserWarning: "http://www.happierabroad.com"" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup
In [17]:
sentences[0]
Out[17]:
['with',
 'all',
 'this',
 'stuff',
 'go',
 'down',
 'at',
 'the',
 'moment',
 'with',
 'mj',
 'i',
 've',
 'start',
 'listen',
 'to',
 'his',
 'music',
 'watch',
 'the',
 'odd',
 'documentari',
 'here',
 'and',
 'there',
 'watch',
 'the',
 'wiz',
 'and',
 'watch',
 'moonwalk',
 'again']
In [18]:
for review in unlabeled_train["review"]:
    sentences += KaggleWord2VecUtility.review_to_sentences(
        review, remove_stopwords=False)
C:\Users\BowlMin\Anaconda3\envs\py35\lib\site-packages\bs4\__init__.py:294: UserWarning: "b'.'" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
  ' Beautiful Soup.' % markup)
C:\Users\BowlMin\Anaconda3\envs\py35\lib\site-packages\bs4\__init__.py:357: UserWarning: "http://www.archive.org/details/LovefromaStranger"" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup
C:\Users\BowlMin\Anaconda3\envs\py35\lib\site-packages\bs4\__init__.py:357: UserWarning: "http://www.loosechangeguide.com/LooseChangeGuide.html"" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup
C:\Users\BowlMin\Anaconda3\envs\py35\lib\site-packages\bs4\__init__.py:294: UserWarning: "b'... ...'" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
  ' Beautiful Soup.' % markup)
C:\Users\BowlMin\Anaconda3\envs\py35\lib\site-packages\bs4\__init__.py:294: UserWarning: "b'...'" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
  ' Beautiful Soup.' % markup)
C:\Users\BowlMin\Anaconda3\envs\py35\lib\site-packages\bs4\__init__.py:294: UserWarning: "b'....'" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
  ' Beautiful Soup.' % markup)
C:\Users\BowlMin\Anaconda3\envs\py35\lib\site-packages\bs4\__init__.py:357: UserWarning: "http://www.msnbc.msn.com/id/4972055/site/newsweek/"" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup
C:\Users\BowlMin\Anaconda3\envs\py35\lib\site-packages\bs4\__init__.py:294: UserWarning: "b'..'" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
  ' Beautiful Soup.' % markup)
C:\Users\BowlMin\Anaconda3\envs\py35\lib\site-packages\bs4\__init__.py:357: UserWarning: "http://www.youtube.com/watch?v=a0KSqelmgN8"" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup
C:\Users\BowlMin\Anaconda3\envs\py35\lib\site-packages\bs4\__init__.py:294: UserWarning: "b'.. .'" looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.
  ' Beautiful Soup.' % markup)
C:\Users\BowlMin\Anaconda3\envs\py35\lib\site-packages\bs4\__init__.py:357: UserWarning: "http://jake-weird.blogspot.com/2007/08/beneath.html"" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup
In [19]:
# asctime(시간), levelname(로깅레벨), message(메세지)
# 코드 상태를 출력해서 실행 상태를 표현
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
In [20]:
# 파라메터값 지정
num_features = 300 # 문자 벡터 차원 수
min_word_count = 40 # 최소 문자 수(10~100 사이가 적당. 어휘의 크기를 의미 있는 단어로 제한하는데 도움을 줌.)
num_workers = 4 # 병렬 처리 스레드 수(병렬 처리 스래드)
context = 10 # 문자열 창 크기(고려해야 할 주변 단어의 수)
downsampling = 1e-3 # 문자 빈도수 Downsample(구글문서는 0.00001에서 0.001사이의 값을 권장)
In [21]:
# 초기화 및 모델 학습
from gensim.models import word2vec
2019-07-25 20:27:17,319 : INFO : 'pattern' package not found; tag filters are not available for English
In [22]:
# 모델 학습
model = word2vec.Word2Vec(sentences, workers=num_workers, size=num_features, min_count=min_word_count,window=context,sample=downsampling)
2019-07-25 20:27:17,336 : INFO : collecting all words and their counts
2019-07-25 20:27:17,340 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-07-25 20:27:17,407 : INFO : PROGRESS: at sentence #10000, processed 225803 words, keeping 12465 word types
2019-07-25 20:27:17,473 : INFO : PROGRESS: at sentence #20000, processed 451892 words, keeping 17070 word types
2019-07-25 20:27:17,539 : INFO : PROGRESS: at sentence #30000, processed 671314 words, keeping 20370 word types
2019-07-25 20:27:17,611 : INFO : PROGRESS: at sentence #40000, processed 897814 words, keeping 23125 word types
2019-07-25 20:27:17,677 : INFO : PROGRESS: at sentence #50000, processed 1116962 words, keeping 25365 word types
2019-07-25 20:27:17,743 : INFO : PROGRESS: at sentence #60000, processed 1338403 words, keeping 27283 word types
2019-07-25 20:27:17,814 : INFO : PROGRESS: at sentence #70000, processed 1561579 words, keeping 29024 word types
2019-07-25 20:27:17,882 : INFO : PROGRESS: at sentence #80000, processed 1780886 words, keeping 30603 word types
2019-07-25 20:27:17,954 : INFO : PROGRESS: at sentence #90000, processed 2004995 words, keeping 32223 word types
2019-07-25 20:27:18,022 : INFO : PROGRESS: at sentence #100000, processed 2226966 words, keeping 33579 word types
2019-07-25 20:27:18,089 : INFO : PROGRESS: at sentence #110000, processed 2446580 words, keeping 34827 word types
2019-07-25 20:27:18,155 : INFO : PROGRESS: at sentence #120000, processed 2668775 words, keeping 36183 word types
2019-07-25 20:27:18,227 : INFO : PROGRESS: at sentence #130000, processed 2894303 words, keeping 37353 word types
2019-07-25 20:27:18,283 : INFO : PROGRESS: at sentence #140000, processed 3107005 words, keeping 38376 word types
2019-07-25 20:27:18,364 : INFO : PROGRESS: at sentence #150000, processed 3332627 words, keeping 39556 word types
2019-07-25 20:27:18,435 : INFO : PROGRESS: at sentence #160000, processed 3555315 words, keeping 40629 word types
2019-07-25 20:27:18,501 : INFO : PROGRESS: at sentence #170000, processed 3778655 words, keeping 41628 word types
2019-07-25 20:27:18,572 : INFO : PROGRESS: at sentence #180000, processed 3999236 words, keeping 42599 word types
2019-07-25 20:27:18,642 : INFO : PROGRESS: at sentence #190000, processed 4224449 words, keeping 43461 word types
2019-07-25 20:27:18,712 : INFO : PROGRESS: at sentence #200000, processed 4448603 words, keeping 44301 word types
2019-07-25 20:27:18,780 : INFO : PROGRESS: at sentence #210000, processed 4669967 words, keeping 45212 word types
2019-07-25 20:27:18,848 : INFO : PROGRESS: at sentence #220000, processed 4894968 words, keeping 46134 word types
2019-07-25 20:27:18,913 : INFO : PROGRESS: at sentence #230000, processed 5117546 words, keeping 46986 word types
2019-07-25 20:27:18,987 : INFO : PROGRESS: at sentence #240000, processed 5345051 words, keeping 47854 word types
2019-07-25 20:27:19,048 : INFO : PROGRESS: at sentence #250000, processed 5559166 words, keeping 48699 word types
2019-07-25 20:27:19,122 : INFO : PROGRESS: at sentence #260000, processed 5779147 words, keeping 49469 word types
2019-07-25 20:27:19,191 : INFO : PROGRESS: at sentence #270000, processed 6000436 words, keeping 50416 word types
2019-07-25 20:27:19,261 : INFO : PROGRESS: at sentence #280000, processed 6226315 words, keeping 51640 word types
2019-07-25 20:27:19,328 : INFO : PROGRESS: at sentence #290000, processed 6449475 words, keeping 52754 word types
2019-07-25 20:27:19,393 : INFO : PROGRESS: at sentence #300000, processed 6674078 words, keeping 53755 word types
2019-07-25 20:27:19,465 : INFO : PROGRESS: at sentence #310000, processed 6899392 words, keeping 54734 word types
2019-07-25 20:27:19,531 : INFO : PROGRESS: at sentence #320000, processed 7124279 words, keeping 55770 word types
2019-07-25 20:27:19,601 : INFO : PROGRESS: at sentence #330000, processed 7346022 words, keeping 56687 word types
2019-07-25 20:27:19,669 : INFO : PROGRESS: at sentence #340000, processed 7575534 words, keeping 57629 word types
2019-07-25 20:27:19,731 : INFO : PROGRESS: at sentence #350000, processed 7798804 words, keeping 58485 word types
2019-07-25 20:27:19,794 : INFO : PROGRESS: at sentence #360000, processed 8019467 words, keeping 59345 word types
2019-07-25 20:27:19,874 : INFO : PROGRESS: at sentence #370000, processed 8246659 words, keeping 60161 word types
2019-07-25 20:27:19,933 : INFO : PROGRESS: at sentence #380000, processed 8471806 words, keeping 61069 word types
2019-07-25 20:27:20,005 : INFO : PROGRESS: at sentence #390000, processed 8701556 words, keeping 61810 word types
2019-07-25 20:27:20,072 : INFO : PROGRESS: at sentence #400000, processed 8924505 words, keeping 62546 word types
2019-07-25 20:27:20,140 : INFO : PROGRESS: at sentence #410000, processed 9145855 words, keeping 63263 word types
2019-07-25 20:27:20,204 : INFO : PROGRESS: at sentence #420000, processed 9366935 words, keeping 64024 word types
2019-07-25 20:27:20,270 : INFO : PROGRESS: at sentence #430000, processed 9594472 words, keeping 64795 word types
2019-07-25 20:27:20,340 : INFO : PROGRESS: at sentence #440000, processed 9821225 words, keeping 65539 word types
2019-07-25 20:27:20,407 : INFO : PROGRESS: at sentence #450000, processed 10044987 words, keeping 66378 word types
2019-07-25 20:27:20,479 : INFO : PROGRESS: at sentence #460000, processed 10277747 words, keeping 67158 word types
2019-07-25 20:27:20,548 : INFO : PROGRESS: at sentence #470000, processed 10505672 words, keeping 67775 word types
2019-07-25 20:27:20,615 : INFO : PROGRESS: at sentence #480000, processed 10726056 words, keeping 68500 word types
2019-07-25 20:27:20,680 : INFO : PROGRESS: at sentence #490000, processed 10952800 words, keeping 69256 word types
2019-07-25 20:27:20,744 : INFO : PROGRESS: at sentence #500000, processed 11174456 words, keeping 69892 word types
2019-07-25 20:27:20,808 : INFO : PROGRESS: at sentence #510000, processed 11399731 words, keeping 70593 word types
2019-07-25 20:27:20,876 : INFO : PROGRESS: at sentence #520000, processed 11623082 words, keeping 71267 word types
2019-07-25 20:27:20,941 : INFO : PROGRESS: at sentence #530000, processed 11847480 words, keeping 71877 word types
2019-07-25 20:27:21,013 : INFO : PROGRESS: at sentence #540000, processed 12072095 words, keeping 72537 word types
2019-07-25 20:27:21,079 : INFO : PROGRESS: at sentence #550000, processed 12297646 words, keeping 73212 word types
2019-07-25 20:27:21,140 : INFO : PROGRESS: at sentence #560000, processed 12518936 words, keeping 73861 word types
2019-07-25 20:27:21,210 : INFO : PROGRESS: at sentence #570000, processed 12748083 words, keeping 74431 word types
2019-07-25 20:27:21,274 : INFO : PROGRESS: at sentence #580000, processed 12969579 words, keeping 75087 word types
2019-07-25 20:27:21,340 : INFO : PROGRESS: at sentence #590000, processed 13195103 words, keeping 75734 word types
2019-07-25 20:27:21,407 : INFO : PROGRESS: at sentence #600000, processed 13417301 words, keeping 76295 word types
2019-07-25 20:27:21,467 : INFO : PROGRESS: at sentence #610000, processed 13638324 words, keeping 76953 word types
2019-07-25 20:27:21,539 : INFO : PROGRESS: at sentence #620000, processed 13864649 words, keeping 77504 word types
2019-07-25 20:27:21,614 : INFO : PROGRESS: at sentence #630000, processed 14088935 words, keeping 78067 word types
2019-07-25 20:27:21,679 : INFO : PROGRESS: at sentence #640000, processed 14309718 words, keeping 78693 word types
2019-07-25 20:27:21,746 : INFO : PROGRESS: at sentence #650000, processed 14535474 words, keeping 79296 word types
2019-07-25 20:27:21,807 : INFO : PROGRESS: at sentence #660000, processed 14758264 words, keeping 79865 word types
2019-07-25 20:27:21,878 : INFO : PROGRESS: at sentence #670000, processed 14981657 words, keeping 80382 word types
2019-07-25 20:27:21,948 : INFO : PROGRESS: at sentence #680000, processed 15206489 words, keeping 80913 word types
2019-07-25 20:27:22,016 : INFO : PROGRESS: at sentence #690000, processed 15428682 words, keeping 81483 word types
2019-07-25 20:27:22,086 : INFO : PROGRESS: at sentence #700000, processed 15657388 words, keeping 82075 word types
2019-07-25 20:27:22,154 : INFO : PROGRESS: at sentence #710000, processed 15880377 words, keeping 82561 word types
2019-07-25 20:27:22,220 : INFO : PROGRESS: at sentence #720000, processed 16105664 words, keeping 83037 word types
2019-07-25 20:27:22,285 : INFO : PROGRESS: at sentence #730000, processed 16332045 words, keeping 83572 word types
2019-07-25 20:27:22,359 : INFO : PROGRESS: at sentence #740000, processed 16553078 words, keeping 84128 word types
2019-07-25 20:27:22,423 : INFO : PROGRESS: at sentence #750000, processed 16771405 words, keeping 84600 word types
2019-07-25 20:27:22,490 : INFO : PROGRESS: at sentence #760000, processed 16990809 words, keeping 85069 word types
2019-07-25 20:27:22,560 : INFO : PROGRESS: at sentence #770000, processed 17217946 words, keeping 85645 word types
2019-07-25 20:27:22,622 : INFO : PROGRESS: at sentence #780000, processed 17448092 words, keeping 86161 word types
2019-07-25 20:27:22,691 : INFO : PROGRESS: at sentence #790000, processed 17675168 words, keeping 86666 word types
2019-07-25 20:27:22,727 : INFO : collected 86997 word types from a corpus of 17798269 raw words and 795538 sentences
2019-07-25 20:27:22,728 : INFO : Loading a fresh vocabulary
2019-07-25 20:27:22,836 : INFO : effective_min_count=40 retains 11986 unique words (13% of original 86997, drops 75011)
2019-07-25 20:27:22,837 : INFO : effective_min_count=40 leaves 17434031 word corpus (97% of original 17798269, drops 364238)
2019-07-25 20:27:22,910 : INFO : deleting the raw counts dictionary of 86997 items
2019-07-25 20:27:22,915 : INFO : sample=0.001 downsamples 50 most-common words
2019-07-25 20:27:22,916 : INFO : downsampling leaves estimated 12872362 word corpus (73.8% of prior 17434031)
2019-07-25 20:27:22,990 : INFO : estimated required memory for 11986 words and 300 dimensions: 34759400 bytes
2019-07-25 20:27:22,991 : INFO : resetting layer weights
2019-07-25 20:27:23,266 : INFO : training model with 4 workers on 11986 vocabulary and 300 features, using sg=0 hs=0 sample=0.001 negative=5 window=10
2019-07-25 20:27:24,294 : INFO : EPOCH 1 - PROGRESS: at 4.27% examples, 544371 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:27:25,299 : INFO : EPOCH 1 - PROGRESS: at 8.67% examples, 552124 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:27:26,302 : INFO : EPOCH 1 - PROGRESS: at 13.08% examples, 555218 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:27:27,305 : INFO : EPOCH 1 - PROGRESS: at 17.46% examples, 555256 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:27:28,317 : INFO : EPOCH 1 - PROGRESS: at 21.86% examples, 555374 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:27:29,317 : INFO : EPOCH 1 - PROGRESS: at 26.25% examples, 556590 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:27:30,321 : INFO : EPOCH 1 - PROGRESS: at 30.55% examples, 556255 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:27:31,322 : INFO : EPOCH 1 - PROGRESS: at 34.99% examples, 556911 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:27:32,327 : INFO : EPOCH 1 - PROGRESS: at 39.36% examples, 557294 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:27:33,329 : INFO : EPOCH 1 - PROGRESS: at 43.64% examples, 557123 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:27:34,345 : INFO : EPOCH 1 - PROGRESS: at 48.00% examples, 556776 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:27:35,351 : INFO : EPOCH 1 - PROGRESS: at 52.32% examples, 556482 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:27:36,360 : INFO : EPOCH 1 - PROGRESS: at 56.27% examples, 552750 words/s, in_qsize 8, out_qsize 1
2019-07-25 20:27:37,379 : INFO : EPOCH 1 - PROGRESS: at 60.52% examples, 552210 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:27:38,389 : INFO : EPOCH 1 - PROGRESS: at 64.79% examples, 551641 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:27:39,391 : INFO : EPOCH 1 - PROGRESS: at 69.14% examples, 552318 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:27:40,394 : INFO : EPOCH 1 - PROGRESS: at 73.44% examples, 552485 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:27:41,403 : INFO : EPOCH 1 - PROGRESS: at 77.83% examples, 552793 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:27:42,413 : INFO : EPOCH 1 - PROGRESS: at 82.27% examples, 553385 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:27:43,427 : INFO : EPOCH 1 - PROGRESS: at 86.70% examples, 553897 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:27:44,429 : INFO : EPOCH 1 - PROGRESS: at 90.98% examples, 553939 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:27:45,431 : INFO : EPOCH 1 - PROGRESS: at 95.41% examples, 554292 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:27:46,439 : INFO : EPOCH 1 - PROGRESS: at 99.72% examples, 554509 words/s, in_qsize 5, out_qsize 0
2019-07-25 20:27:46,456 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-07-25 20:27:46,486 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-07-25 20:27:46,487 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-07-25 20:27:46,489 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-07-25 20:27:46,490 : INFO : EPOCH - 1 : training on 17798269 raw words (12874652 effective words) took 23.2s, 554793 effective words/s
2019-07-25 20:27:47,518 : INFO : EPOCH 2 - PROGRESS: at 4.27% examples, 545509 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:27:48,533 : INFO : EPOCH 2 - PROGRESS: at 8.72% examples, 553196 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:27:49,539 : INFO : EPOCH 2 - PROGRESS: at 13.14% examples, 555689 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:27:50,551 : INFO : EPOCH 2 - PROGRESS: at 17.41% examples, 550637 words/s, in_qsize 8, out_qsize 2
2019-07-25 20:27:51,556 : INFO : EPOCH 2 - PROGRESS: at 21.86% examples, 554024 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:27:52,558 : INFO : EPOCH 2 - PROGRESS: at 25.97% examples, 549400 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:27:53,560 : INFO : EPOCH 2 - PROGRESS: at 30.32% examples, 551136 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:27:54,570 : INFO : EPOCH 2 - PROGRESS: at 34.82% examples, 552722 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:27:55,572 : INFO : EPOCH 2 - PROGRESS: at 39.19% examples, 553721 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:27:56,578 : INFO : EPOCH 2 - PROGRESS: at 43.52% examples, 554297 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:27:57,588 : INFO : EPOCH 2 - PROGRESS: at 47.95% examples, 555265 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:27:58,598 : INFO : EPOCH 2 - PROGRESS: at 52.32% examples, 555499 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:27:59,610 : INFO : EPOCH 2 - PROGRESS: at 56.71% examples, 556062 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:00,615 : INFO : EPOCH 2 - PROGRESS: at 60.97% examples, 555903 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:01,619 : INFO : EPOCH 2 - PROGRESS: at 65.35% examples, 556233 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:02,623 : INFO : EPOCH 2 - PROGRESS: at 69.70% examples, 556537 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:03,635 : INFO : EPOCH 2 - PROGRESS: at 74.12% examples, 557006 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:04,638 : INFO : EPOCH 2 - PROGRESS: at 78.49% examples, 557261 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:05,651 : INFO : EPOCH 2 - PROGRESS: at 82.94% examples, 557508 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:06,661 : INFO : EPOCH 2 - PROGRESS: at 87.37% examples, 557897 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:07,671 : INFO : EPOCH 2 - PROGRESS: at 91.71% examples, 557909 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:08,678 : INFO : EPOCH 2 - PROGRESS: at 96.19% examples, 558294 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:09,520 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-07-25 20:28:09,531 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-07-25 20:28:09,544 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-07-25 20:28:09,558 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-07-25 20:28:09,559 : INFO : EPOCH - 2 : training on 17798269 raw words (12873701 effective words) took 23.0s, 558567 effective words/s
2019-07-25 20:28:10,594 : INFO : EPOCH 3 - PROGRESS: at 4.37% examples, 559349 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:11,595 : INFO : EPOCH 3 - PROGRESS: at 8.78% examples, 560239 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:12,600 : INFO : EPOCH 3 - PROGRESS: at 13.14% examples, 558277 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:13,612 : INFO : EPOCH 3 - PROGRESS: at 17.64% examples, 559656 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:14,636 : INFO : EPOCH 3 - PROGRESS: at 22.08% examples, 559128 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:15,636 : INFO : EPOCH 3 - PROGRESS: at 26.51% examples, 560881 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:16,653 : INFO : EPOCH 3 - PROGRESS: at 30.92% examples, 559835 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:17,676 : INFO : EPOCH 3 - PROGRESS: at 35.44% examples, 560349 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:18,677 : INFO : EPOCH 3 - PROGRESS: at 39.86% examples, 561401 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:19,678 : INFO : EPOCH 3 - PROGRESS: at 44.15% examples, 560900 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:20,685 : INFO : EPOCH 3 - PROGRESS: at 48.55% examples, 561325 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:21,685 : INFO : EPOCH 3 - PROGRESS: at 52.94% examples, 561515 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:22,694 : INFO : EPOCH 3 - PROGRESS: at 57.31% examples, 561741 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:23,699 : INFO : EPOCH 3 - PROGRESS: at 61.64% examples, 561659 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:24,705 : INFO : EPOCH 3 - PROGRESS: at 66.01% examples, 561517 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:25,711 : INFO : EPOCH 3 - PROGRESS: at 70.39% examples, 561449 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:26,715 : INFO : EPOCH 3 - PROGRESS: at 74.74% examples, 561457 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:27,717 : INFO : EPOCH 3 - PROGRESS: at 79.12% examples, 561482 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:28,732 : INFO : EPOCH 3 - PROGRESS: at 83.11% examples, 558472 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:29,770 : INFO : EPOCH 3 - PROGRESS: at 86.86% examples, 553728 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:30,780 : INFO : EPOCH 3 - PROGRESS: at 90.93% examples, 552203 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:31,787 : INFO : EPOCH 3 - PROGRESS: at 94.71% examples, 548944 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:32,798 : INFO : EPOCH 3 - PROGRESS: at 99.05% examples, 549342 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:32,979 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-07-25 20:28:32,989 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-07-25 20:28:33,011 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-07-25 20:28:33,015 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-07-25 20:28:33,018 : INFO : EPOCH - 3 : training on 17798269 raw words (12872526 effective words) took 23.4s, 549378 effective words/s
2019-07-25 20:28:34,038 : INFO : EPOCH 4 - PROGRESS: at 4.27% examples, 544563 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:35,049 : INFO : EPOCH 4 - PROGRESS: at 8.67% examples, 550997 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:36,062 : INFO : EPOCH 4 - PROGRESS: at 13.08% examples, 552851 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:37,067 : INFO : EPOCH 4 - PROGRESS: at 17.59% examples, 556571 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:38,072 : INFO : EPOCH 4 - PROGRESS: at 21.98% examples, 557353 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:39,078 : INFO : EPOCH 4 - PROGRESS: at 26.35% examples, 557793 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:40,079 : INFO : EPOCH 4 - PROGRESS: at 30.74% examples, 558453 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:41,093 : INFO : EPOCH 4 - PROGRESS: at 35.21% examples, 558731 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:42,098 : INFO : EPOCH 4 - PROGRESS: at 39.59% examples, 558998 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:43,101 : INFO : EPOCH 4 - PROGRESS: at 43.92% examples, 559296 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:44,123 : INFO : EPOCH 4 - PROGRESS: at 48.33% examples, 559087 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:45,132 : INFO : EPOCH 4 - PROGRESS: at 52.73% examples, 559084 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:46,134 : INFO : EPOCH 4 - PROGRESS: at 57.04% examples, 559252 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:47,144 : INFO : EPOCH 4 - PROGRESS: at 61.36% examples, 559099 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:48,151 : INFO : EPOCH 4 - PROGRESS: at 65.79% examples, 559673 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:49,154 : INFO : EPOCH 4 - PROGRESS: at 70.10% examples, 559287 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:50,171 : INFO : EPOCH 4 - PROGRESS: at 74.34% examples, 558148 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:51,189 : INFO : EPOCH 4 - PROGRESS: at 78.67% examples, 557474 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:52,190 : INFO : EPOCH 4 - PROGRESS: at 82.77% examples, 555866 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:53,196 : INFO : EPOCH 4 - PROGRESS: at 87.14% examples, 556047 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:54,196 : INFO : EPOCH 4 - PROGRESS: at 91.49% examples, 556354 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:28:55,198 : INFO : EPOCH 4 - PROGRESS: at 95.86% examples, 556315 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:56,122 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-07-25 20:28:56,138 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-07-25 20:28:56,149 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-07-25 20:28:56,157 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-07-25 20:28:56,158 : INFO : EPOCH - 4 : training on 17798269 raw words (12873230 effective words) took 23.1s, 556613 effective words/s
2019-07-25 20:28:57,188 : INFO : EPOCH 5 - PROGRESS: at 4.27% examples, 542598 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:58,194 : INFO : EPOCH 5 - PROGRESS: at 8.67% examples, 550867 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:28:59,206 : INFO : EPOCH 5 - PROGRESS: at 13.08% examples, 552895 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:29:00,211 : INFO : EPOCH 5 - PROGRESS: at 17.46% examples, 553041 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:29:01,214 : INFO : EPOCH 5 - PROGRESS: at 21.86% examples, 554813 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:29:02,218 : INFO : EPOCH 5 - PROGRESS: at 26.25% examples, 555798 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:29:03,220 : INFO : EPOCH 5 - PROGRESS: at 30.62% examples, 556647 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:29:04,229 : INFO : EPOCH 5 - PROGRESS: at 35.05% examples, 556740 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:29:05,238 : INFO : EPOCH 5 - PROGRESS: at 39.42% examples, 556951 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:29:06,242 : INFO : EPOCH 5 - PROGRESS: at 43.75% examples, 557364 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:29:07,250 : INFO : EPOCH 5 - PROGRESS: at 48.06% examples, 556838 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:29:08,259 : INFO : EPOCH 5 - PROGRESS: at 52.48% examples, 557624 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:29:09,262 : INFO : EPOCH 5 - PROGRESS: at 56.82% examples, 557833 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:29:10,276 : INFO : EPOCH 5 - PROGRESS: at 61.13% examples, 557658 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:29:11,286 : INFO : EPOCH 5 - PROGRESS: at 65.57% examples, 558170 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:29:12,293 : INFO : EPOCH 5 - PROGRESS: at 69.76% examples, 556902 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:29:13,294 : INFO : EPOCH 5 - PROGRESS: at 74.07% examples, 556833 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:29:14,295 : INFO : EPOCH 5 - PROGRESS: at 78.31% examples, 556376 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:29:15,300 : INFO : EPOCH 5 - PROGRESS: at 82.67% examples, 556179 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:29:16,309 : INFO : EPOCH 5 - PROGRESS: at 86.97% examples, 555901 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:29:17,316 : INFO : EPOCH 5 - PROGRESS: at 91.32% examples, 556051 words/s, in_qsize 7, out_qsize 0
2019-07-25 20:29:18,325 : INFO : EPOCH 5 - PROGRESS: at 95.76% examples, 556132 words/s, in_qsize 8, out_qsize 0
2019-07-25 20:29:19,272 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-07-25 20:29:19,298 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-07-25 20:29:19,311 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-07-25 20:29:19,314 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-07-25 20:29:19,315 : INFO : EPOCH - 5 : training on 17798269 raw words (12873333 effective words) took 23.1s, 556345 effective words/s
2019-07-25 20:29:19,319 : INFO : training on a 88991345 raw words (64367442 effective words) took 116.1s, 554641 effective words/s
In [23]:
model
Out[23]:
<gensim.models.word2vec.Word2Vec at 0x1e7e0ab1390>
In [24]:
# 학습 완료 후 필요없는 메모리 unload.
model.init_sims(replace=True)
2019-07-25 20:29:19,347 : INFO : precomputing L2-norms of word weight vectors
In [25]:
model_name = '300features_40minwords_10text'
In [26]:
# 모델 저장
model.save(model_name)
2019-07-25 20:29:19,396 : INFO : saving Word2Vec object under 300features_40minwords_10text, separately None
2019-07-25 20:29:19,398 : INFO : not storing attribute vectors_norm
2019-07-25 20:29:19,401 : INFO : not storing attribute cum_table
2019-07-25 20:29:19,958 : INFO : saved 300features_40minwords_10text

word2vec 모델 실행

In [27]:
# 유사도 없는 단어 추출
model.wv.doesnt_match('man woman child kitchen'.split())
C:\Users\BowlMin\Anaconda3\envs\py35\lib\site-packages\gensim\models\keyedvectors.py:877: FutureWarning: arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.
  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)
Out[27]:
'kitchen'
In [28]:
model.wv.doesnt_match("france england germany berlin".split())
2019-07-25 20:29:19,984 : WARNING : vectors for words {'france', 'germany'} are not present in the model, ignoring these words
C:\Users\BowlMin\Anaconda3\envs\py35\lib\site-packages\gensim\models\keyedvectors.py:877: FutureWarning: arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.
  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)
Out[28]:
'berlin'
In [29]:
# 특정 단어와의 유사성이 가장 높은 단어들
model.wv.most_similar("man")
Out[29]:
[('woman', 0.6385219097137451),
 ('millionair', 0.49444377422332764),
 ('businessman', 0.4919447898864746),
 ('ladi', 0.4828992187976837),
 ('widow', 0.4747314751148224),
 ('billionair', 0.4741416275501251),
 ('farmer', 0.4725267291069031),
 ('lad', 0.4719681143760681),
 ('lawyer', 0.46801966428756714),
 ('men', 0.4660264849662781)]
In [30]:
model.wv.most_similar("queen")
Out[30]:
[('princess', 0.6024907231330872),
 ('goddess', 0.5887964963912964),
 ('victoria', 0.5526267290115356),
 ('stepmoth', 0.5463555455207825),
 ('eva', 0.5462982654571533),
 ('anita', 0.5314058661460876),
 ('madam', 0.5280230045318604),
 ('dame', 0.5271132588386536),
 ('seductress', 0.5239138603210449),
 ('mistress', 0.5218404531478882)]
In [31]:
model.wv.most_similar("film")
Out[31]:
[('movi', 0.8594891428947449),
 ('flick', 0.6227060556411743),
 ('documentari', 0.5657299160957336),
 ('pictur', 0.549752950668335),
 ('cinema', 0.5217683911323547),
 ('it', 0.49816834926605225),
 ('sequel', 0.4946991801261902),
 ('masterpiec', 0.4758005440235138),
 ('effort', 0.467248797416687),
 ('genr', 0.46577876806259155)]
In [32]:
model.wv.most_similar("happi")
Out[32]:
[('unhappi', 0.42433738708496094),
 ('satisfi', 0.41146549582481384),
 ('sad', 0.39781808853149414),
 ('glad', 0.3798343241214752),
 ('uplift', 0.3791547417640686),
 ('bitter', 0.3744077682495117),
 ('upset', 0.372517466545105),
 ('afraid', 0.36562544107437134),
 ('joy', 0.3651914894580841),
 ('comfort', 0.36515605449676514)]

word2vec으로 벡터화 한 단어를 시각화

In [33]:
from sklearn.manifold import TSNE
import matplotlib as mpl
import matplotlib.pyplot as plt
import gensim
import gensim.models as g
In [34]:
model_name = '300features_40minwords_10text'
In [35]:
# 저장한 모델 출력
model = g.Doc2Vec.load(model_name)
2019-07-25 20:29:20,444 : INFO : loading Doc2Vec object from 300features_40minwords_10text
2019-07-25 20:29:20,806 : INFO : loading wv recursively from 300features_40minwords_10text.wv.* with mmap=None
2019-07-25 20:29:20,807 : INFO : setting ignored attribute vectors_norm to None
2019-07-25 20:29:20,808 : INFO : loading trainables recursively from 300features_40minwords_10text.trainables.* with mmap=None
2019-07-25 20:29:20,809 : INFO : loading vocabulary recursively from 300features_40minwords_10text.vocabulary.* with mmap=None
2019-07-25 20:29:20,810 : INFO : setting ignored attribute cum_table to None
2019-07-25 20:29:20,811 : INFO : loaded 300features_40minwords_10text
In [36]:
# 모델의 단어 빈도수를 리스트화
vocab = list(model.wv.vocab)
In [37]:
# 단어마다 벡터값으로 변환
X = model[vocab]
C:\Users\BowlMin\Anaconda3\envs\py35\lib\site-packages\ipykernel_launcher.py:2: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).

In [38]:
print(X)
[[ 0.07133124 -0.06533014  0.0060728  ... -0.02894855 -0.0623975
   0.07982507]
 [-0.02169806  0.03199218 -0.03792744 ... -0.0440563   0.06800202
  -0.01793549]
 [ 0.09001943  0.0025404   0.0137266  ...  0.02830658 -0.01290188
   0.11468274]
 ...
 [-0.07605617  0.05933593  0.01443341 ... -0.07376457  0.06024652
  -0.04897078]
 [-0.09916275  0.00330711  0.04470911 ... -0.07864594 -0.04736565
   0.03975831]
 [-0.07319105 -0.08357555  0.02091095 ... -0.02201691 -0.03904636
   0.0109644 ]]
In [39]:
print(len(X))
11986
In [40]:
print(X[0][:10])
[ 0.07133124 -0.06533014  0.0060728   0.03319211 -0.07066774 -0.01097665
 -0.00853973 -0.01172582  0.04873645  0.02962803]
In [41]:
tsne = TSNE(n_components=2)# 임배딩 공간의 차원 2
In [42]:
X_tsne = tsne.fit_transform(X[:100,:]) # 구조를 변경
In [43]:
df = pd.DataFrame(X_tsne, index=vocab[:100], columns=['x', 'y'])
In [44]:
df.shape
Out[44]:
(100, 2)
In [45]:
df.head(10)
Out[45]:
x y
languag 3.289710 2.189329
outlandish 4.910164 -0.631405
lukewarm 6.514871 -1.092859
carnag 1.459178 -0.022326
falter 5.982720 -0.108871
k 2.890386 -5.067970
expend 0.582870 -1.119917
astound 5.645493 1.833828
crumb -6.061647 -1.190784
unpredict 5.001503 -0.318403
In [46]:
fig = plt.figure() # 그래프 생성
fig.set_size_inches(40, 20)
ax = fig.add_subplot(1, 1, 1)# 행 방향, 열방향, n번째 인자

ax.scatter(df['x'], df['y']) # 점으로 데이터 표시

for word, pos in df.iterrows():
    ax.annotate(word, pos, fontsize=30)# 점의 위치에 문자열 출력

plt.show()

mobydick 텍스트 파일을 활용한 모델링

In [47]:
# 다른 데이터를 이용하여 모델링
import re
import numpy as np
In [48]:
file = open("Data/mobydick.txt", "r")
moby_dick = file.read()
moby_dick = re.split("[\n\.?]", moby_dick)
In [49]:
while '' in moby_dick:
    moby_dick.remove('')
In [50]:
MobyDick = pd.DataFrame()
MobyDick['sentences'] = np.asarray(moby_dick)
MobyDick['sentences_separated'] = MobyDick['sentences'].apply(lambda x: x.replace(",", ""))
MobyDick['sentences_separated'] = MobyDick['sentences_separated'].apply(lambda x: x.replace(";",""))
MobyDick['sentences_separated'] = MobyDick['sentences_separated'].apply(lambda x: x.replace("\"",""))
MobyDick['sentences_separated'] = MobyDick['sentences_separated'].apply(lambda x: x.split())
In [51]:
model_moby = word2vec.Word2Vec(MobyDick['sentences_separated'],hs=1,size=300,min_count=5)# hs=1 : softmax를 트레이닝할때 사용
2019-07-25 20:29:24,298 : INFO : collecting all words and their counts
2019-07-25 20:29:24,301 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-07-25 20:29:24,341 : INFO : PROGRESS: at sentence #10000, processed 64297 words, keeping 10848 word types
2019-07-25 20:29:24,371 : INFO : collected 15556 word types from a corpus of 112814 raw words and 16985 sentences
2019-07-25 20:29:24,372 : INFO : Loading a fresh vocabulary
2019-07-25 20:29:24,391 : INFO : effective_min_count=5 retains 2472 unique words (15% of original 15556, drops 13084)
2019-07-25 20:29:24,392 : INFO : effective_min_count=5 leaves 92969 word corpus (82% of original 112814, drops 19845)
2019-07-25 20:29:24,414 : INFO : deleting the raw counts dictionary of 15556 items
2019-07-25 20:29:24,416 : INFO : sample=0.001 downsamples 57 most-common words
2019-07-25 20:29:24,418 : INFO : downsampling leaves estimated 65559 word corpus (70.5% of prior 92969)
2019-07-25 20:29:24,426 : INFO : constructing a huffman tree from 2472 words
2019-07-25 20:29:24,539 : INFO : built huffman tree with maximum node depth 14
2019-07-25 20:29:24,547 : INFO : estimated required memory for 2472 words and 300 dimensions: 10629600 bytes
2019-07-25 20:29:24,548 : INFO : resetting layer weights
2019-07-25 20:29:24,629 : INFO : training model with 3 workers on 2472 vocabulary and 300 features, using sg=0 hs=1 sample=0.001 negative=5 window=5
2019-07-25 20:29:24,862 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-07-25 20:29:24,866 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-07-25 20:29:24,889 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-07-25 20:29:24,890 : INFO : EPOCH - 1 : training on 112814 raw words (65590 effective words) took 0.3s, 259137 effective words/s
2019-07-25 20:29:25,140 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-07-25 20:29:25,147 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-07-25 20:29:25,155 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-07-25 20:29:25,156 : INFO : EPOCH - 2 : training on 112814 raw words (65673 effective words) took 0.2s, 265168 effective words/s
2019-07-25 20:29:25,420 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-07-25 20:29:25,421 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-07-25 20:29:25,430 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-07-25 20:29:25,431 : INFO : EPOCH - 3 : training on 112814 raw words (65563 effective words) took 0.3s, 256356 effective words/s
2019-07-25 20:29:25,681 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-07-25 20:29:25,689 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-07-25 20:29:25,710 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-07-25 20:29:25,711 : INFO : EPOCH - 4 : training on 112814 raw words (65753 effective words) took 0.3s, 257072 effective words/s
2019-07-25 20:29:25,963 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-07-25 20:29:25,972 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-07-25 20:29:25,986 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-07-25 20:29:25,987 : INFO : EPOCH - 5 : training on 112814 raw words (65683 effective words) took 0.3s, 245945 effective words/s
2019-07-25 20:29:25,990 : INFO : training on a 564070 raw words (328262 effective words) took 1.4s, 241225 effective words/s

google news 파일 모델링

In [52]:
# google news 모델링
google_trained = gensim.models.KeyedVectors.load_word2vec_format('Data/GoogleNews-vectors-negative300.bin', binary=True)
2019-07-25 20:29:26,002 : INFO : loading projection weights from Data/GoogleNews-vectors-negative300.bin
2019-07-25 20:31:40,610 : INFO : loaded (3000000, 300) matrix from Data/GoogleNews-vectors-negative300.bin

모델링 간의 유사도 차이 확인

In [53]:
# mobydick 모델에서의 whale 유사도
for word, score in model_moby.wv.most_similar("whale"):
    print(word)
2019-07-25 20:31:40,712 : INFO : precomputing L2-norms of word weight vectors
High
over
ness
three
lance
such
bore
months
those
into
In [54]:
# kaggle 모델에서의 whale 유사도
for word, score in model.wv.most_similar("whale"):
    print(word)
2019-07-25 20:31:40,833 : INFO : precomputing L2-norms of word weight vectors
squid
snowman
zodiac
snake
reptil
giant
driller
shrew
gorilla
tomato
In [55]:
# google news에서의 whale 유사도
for word, score in google_trained.wv.most_similar("whale"):
    print(word)
C:\Users\BowlMin\Anaconda3\envs\py35\lib\site-packages\ipykernel_launcher.py:2: DeprecationWarning: Call to deprecated `wv` (Attribute will be removed in 4.0.0, use self instead).

2019-07-25 20:31:41,054 : INFO : precomputing L2-norms of word weight vectors
whales
humpback_whale
dolphin
humpback
minke_whale
humpback_whales
dolphins
humpbacks
shark
orca
In [ ]: